---
id: llamaindex
title: LlamaIndex
sidebar_label: LlamaIndex
---

## Quick Summary

LlamaIndex is a data framework for LLMs that facilitates the ingestion of data from various sources such as APIs, databases, and PDFs, and indexes it for later retrieval in RAG-based LLM applications.

:::tip
DeepEval streamlines the process of **evaluating your RAG LlamaIndex applications with just a few lines of code**.
:::

## Evaluating LlamaIndex

To begin evaluating your LlamaIndex application, first call `trace_llama_index` and set `auto_eval` to True. Then, define your `target_model` (LLM application you wish to evaluate), and run `auto_evaluate` on your model and list of metrics.

:::caution Important
Call `trace_llama_index` before defining your model, especially before initializing `VectorStoreIndex` or `QueryEngine`.
:::

```python
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.integrations import trace_llama_index
from deepeval.auto_evaluate import auto_evaluate

trace_llama_index(auto_eval=True)

# define your chatbot
def target_model(input: str):
    ....

auto_evaluate(target_model, metrics=[AnswerRelevancyMetric(), FaithfulnessMetric()])
```

There are **2 required** and 20 optional parameters when running `auto_evaluate`:

**Required Parameters:**

- `target_model`: the model you wish to evaluate, which can either be an asynchronous or synchronous function. The function must accept exactly 1 argument (`input: string`) and return a string as its response.
- `metrics`: a list of metrics of type BaseMetric.

:::info
The remaining **20 optional parameters** can be broken down into evaluation, synthesizer, async, and caching parameters. These allow you to customize any part of the evaluation pipeline, from dataset generation to evaluation.
:::

**Optional Evaluation Parameters:**

- [Optional] `hyperparameters`: a dict of type `dict[str, Union[str, int, float]]`. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
- [Optional] `identifier`: a string that allows you to better identify your test run on Confident AI.
- [Optional] `skip_on_missing_params`: a boolean which when set to `True`, skips all metric executions for test cases with missing parameters. Defaulted to `False`.
- [Optional] `ignore_errors`: a boolean which when set to `True`, ignores all exceptions raised during metrics execution for each test case. Defaulted to `False`.
- [Optional] `verbose_mode`: a optional boolean which when **IS NOT** `None`, overrides each [metric's `verbose_mode` value](metrics-introduction#debugging-a-metric). Defaulted to `None`.
- [Optional] `show_indicator`: a boolean which when set to `True`, shows the evaluation progress indicator for each individual metric. Defaulted to `True`.
- [Optional] `print_results`: a boolean which when set to `True`, prints the result of each evaluation. Defaulted to `True`.

**Optional Synthesizer Parameters:**

- [Optional] `synthesizer_model`: a string specifying which of OpenAI's GPT models to use for generation, **OR** [any custom LLM model](metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to `gpt-4o`.
- [Optional] `filtration_config`: an instance of type `FiltrationConfig` that allows you to [customize the degree of which goldens are filtered](synthesizer-introduction#filtration-quality) during generation. Defaulted to the default `FiltrationConfig` values.
- [Optional] `evolution_config`: an instance of type `EvolutionConfig` that allows you to [customize the complexity of evolutions applied](synthesizer-introduction#evolution-complexity) during generation. Defaulted to the default `EvolutionConfig` values.
- [Optional] `styling_config`: an instance of type `StylingConfig` that allows you to [customize the styles and formats](synthesizer-introduction#styling-options) of generations. Defaulted to the default `StylingConfig` values.
- [Optional] `include_expected_output`: a boolean which when set to `True`, will additionally generate an `expected_output` for each synthetic `Golden`. Defaulted to `True`.
- [Optional] `max_goldens_per_context`: the maximum number of goldens to be generated per context. Defaulted to 2.

**Optional Async Parameters:**

- [Optional] `synthesizer_async_mode`: a boolean which when set to `True`, enables **concurrent generation of goldens**. Defaulted to `True`.
- [Optional] `run_async`: a boolean which when set to `True`, enables concurrent evaluation of test cases **AND** metrics. Defaulted to `True`.
- [Optional] `max_concurrent`: an integer that determines the maximum number of test cases that can be ran (and goldens that can be generated) in parallel at any point in time. You can decrease this value if your evaluation model is running into rate limit errors. Defaulted to 100.
- [Optional] `throttle_value`: an integer that determines how long (in seconds) to throttle the evaluation of each test case. You can increase this value if your evaluation model is running into rate limit errors. Defaulted to 0.

**Optional Cache Parameters:**

- [Optional] `cache_dataset`: a boolean which when set to `True`, writes generated goldens to **DISK** and uses cached goldens for evaluation. Defaulted to `True`.
- [Optional] `write_cache`: a boolean which when set to `True`, uses writes test run results to **DISK**. Defaulted to `True`.
- [Optional] `use_cache`: a boolean which when set to `True`, uses cached test run results instead. Defaulted to `False`.

## How it Works?

`auto_evaluate` automatically scans your LlamaIndex application to extract document nodes. It then generates synthetic data tailored to your application from these nodes, creates LLM responses to these goldens, converts them into test cases, and runs evaluations using the DeepEval metrics you defined.

<div
  style={{
    display: "flex",
    alignItems: "center",
    justifyContent: "center",
  }}
>
  <img
    src="https://deepeval-docs.s3.amazonaws.com/autoeval_llama.svg"
    style={{
      height: "auto",
      maxHeight: "800px",
      marginTop: "20px",
      marginBottom: "40px",
    }}
  />
</div>

:::note  
`auto_evaluate` simplifies the evaluation of your LlamaIndex application. For further customization of your evaluation pipeline, refer to the sections below.  
:::

## Customizing LlamaIndex Evaluations

### Creating Test Cases

Lets use this RAG application built using Llamaindex as an example:

```python
from llama_index import VectorStoreIndex, SimpleDirectoryReader

# Consult the LlamaIndex docs if you're unsure what this does
documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()
index = VectorStoreIndex.from_documents(documents)
rag_application = index.as_query_engine()
```

Instead of calling `trace_llama_index`, you can query your RAG application and evaluate each response using `deepeval`'s metrics, extracting all necessary outputs and retrieval contexts for each given input to your LlamaIndex application.

:::info
This is especially useful if you find that `auto_evaluate` is not accurately capturing your intended documents and correctly populating the `retrieval_context`.
:::

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
...

# An example input to your RAG application
user_input = "What is LlamaIndex?"

# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)

# Process the response object to get the output string
# and retrieved nodes
if response_object is not None:
    actual_output = response_object.response
    retrieval_context = [node.get_content() for node in response_object.source_nodes]

# Create a test case and metric as usual
test_case = LLMTestCase(
    input=user_input,
    actual_output=actual_output,
    retrieval_context=retrieval_context
)
answer_relevancy_metric = AnswerRelevancyMetric()

# Evaluate
answer_relevancy_metric.measure(test_case)
print(answer_relevancy_metric.score)
print(answer_relevancy_metric.reason)
```

### Unit Testing with LlamaIndex

Unit testing LlamaIndex is as simple as defining an `EvaluationDataset` and generating `actual_output`s and `retrieval_context`s at evaluation time. Building upon the previous example:

```python test_llamaindex.py
import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset, Golden

example_golden = Golden(input="What is LlamaIndex?")

dataset = EvaluationDataset(goldens=[example_golden])

@pytest.mark.parametrize(
    "golden",
    dataset.goldens,
)
def test_rag(golden: Golden):
    # LlamaIndex returns a response object that contains
    # both the output string and retrieved nodes
    response_object = rag_application.query(golden.input)

    # Process the response object to get the output string
    # and retrieved nodes
    if response_object is not None:
        actual_output = response_object.response
        retrieval_context = [node.get_content() for node in response_object.source_nodes]

    test_case = LLMTestCase(
        input=golden.input,
        actual_output=actual_output,
        retrieval_context=retrieval_context
    )
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [answer_relevancy_metric])
```

Here, `rag_application` is simply the query engine from Llamaindex, which you'll have to import from the appropriate module.

:::note
In this example, we initialized an `EvaluationDataset` with a `Golden` instead of an `LLMTestCase`. This is because we're generating `actual_output`s and `retrieval_context`s at evaluation time, meaning we cannot initialize with test cases since an `LLMTestCase` requires an `actual_output` to create. Remember, a `Golden` do not require an `actual_output`, so whilst test cases are always ready for evaluation, a golden isn't.
:::

## Using DeepEval from LlamaIndex

In LlamaIndex, there are entities known as evaluators that evaluates the responses of LlamaIndex applications. Continuing from the previous example, here's an alternative way to make use of the `AnswerRelevancyMetric` through `deepeval`'s LlamaIndex evaluators:

```python
from deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator
...

# An example input to your RAG application
user_input = "What is LlamaIndex?"

# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)

evaluator = DeepEvalAnswerRelevancyEvaluator()
evaluation_result = evaluator.evaluate_response(
    query=user_input,
    response=response_object
)
print(evaluation_result)
```

:::info
In LlamaIndex's documentation, you might see examples where the `evaluate()` method is called on an evaluator instead of the `evaluate_response()` method. While both is correct, you should **ALWAYS** use the `evaluate_response()` methods when using `deepeval`'s LlamaIndex evaluators.
:::

### Answer Relevancy

The `DeepEvalAnswerRelevancyEvaluator` uses `deepeval`'s `AnswerRelevancyMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator

evaluator = DeepEvalAnswerRelevancyEvaluator(
    # Optional. A float representing the minimum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    model="gpt-4-0125-preview",
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

### Faithfulness

The `DeepEvalFaithfulnessEvaluator` uses `deepeval`'s `FaithfulnessMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalFaithfulnessEvaluator

evaluator = DeepEvalFaithfulnessEvaluator(
    # Optional. A float representing the minimum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    model="gpt-4-0125-preview",
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

### Contextual Relevancy

The `DeepEvalContextualRelevancyEvaluator` uses `deepeval`'s `ContextualRelevancyMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalContextualRelevancyEvaluator

evaluator = DeepEvalContextualRelevancyEvaluator(
    # Optional. A float representing the minimum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    model="gpt-4-0125-preview",
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

### Summarization

The `DeepEvalSummarizationEvaluator` uses `deepeval`'s `SummarizationMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalSummarizationEvaluator

evaluator = DeepEvalSummarizationEvaluator(
    # Optional. A float representing the minimum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    model="gpt-4-0125-preview",
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

### Bias

The `DeepEvalBiasEvaluator` uses `deepeval`'s `BiasMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalBiasEvaluator

evaluator = DeepEvalBiasEvaluator(
    # Optional. A float representing the maximum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    model="gpt-4-0125-preview",
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

### Toxicity

The `DeepEvalToxicityEvaluator` uses `deepeval`'s `ToxicityMetric` for evaluation.

```python
from deepeval.integrations.llama_index import DeepEvalToxicityEvaluator

evaluator = DeepEvalToxicityEvaluator(
    # Optional. A float representing the maximum passing threshold, defaulted to 0.5.
    threshold=0.5,
    # Optional. A string specifying which of OpenAI's GPT models to use, defaulted to 'gpt-4o'.
    # Optional. A boolean which when set to `True`, will include a reason for its evaluation score, defaulted to `True`.
    include_reason=True
)
```

## Metrics vs Evaluators

While both `deepeval`'s metrics and evaluators yield the same result, `deepeval` is a full evaluation suite built specifically for LLM evaluation. Naturally, `deepeval` forces you to follow evaluation best practices, something not accomplishable through the use of the evaluators abstraction.

So while both metrics and evaluators can be used for a one-off, standalone evaluation, metrics:

- can be combined to evaluate multiple criteria asynchronously
- can be used to evaluate entire `EvaluationDataset`s
- can leverage `deepeval`'s native Pytest integration to unit test LlamaIndex applications in CI/CD pipelines
- can be used with any framework, meaning you are not vendor locked-in into LlamaIndex
- covers a wider range of evaluation criteria/use cases
- automatically integrates with [Confident AI](https://documentation.confident-ai.com), which offers evaluation analysis, evaluation debugging, dataset management, and real-time evaluations in production

:::note
The only upside of using `deepeval`'s LlamaIndex evaluators instead of metrics, is an evaluator automatically extracts the `retrieval_context` from a LlamaIndex response. However, as shown in previous examples, manually extracting the `retrieval_context` from a LlamaIndex response is extremely straightforward:

```python
...

# LlamaIndex returns a response object that contains
# both the output string and retrieved nodes
response_object = rag_application.query(user_input)

# Process the response object to get the output string
# and retrieved nodes
if response_object is not None:
    actual_output = response_object.response
    retrieval_context = [node.get_content() for node in response_object.source_nodes]
```

:::
